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gradient  ascent  methods  to  remain  at  a  local  maximum.  From  the  standpoint  of  learning  theory,  how¬ 
ever.  GW  leave  open  several  questions  that  can  be  addressed  by  a  more  precise  formalization  in  terms  of 
Markov  structures  (a  possible  formalization  suggested  but  left  unpursued  in  a  footnote  of  GW).  In  this 
paper  we  explicitly  formalize  learning  in  a  finite  parameter  space  as  a  Markov  structure  whose  states  are 
parameter  settings.  Several  important  results  that  follow  directly  from  this  characterization,  include  (1)  A 
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1  Introduction:  The  Triggering  Model 
as  a  Markov  structure 

Recently,  Gibson  and  VVexler  ([1],  GW)  have  begun  to 
formalize  the  notion  of  language  learning  in  a  (finite) 
space  whose  grammars  (and  languages)  are  character¬ 
ized  by  a  finite  number  of  parameters  or  1-dimensional 
Boolean-valued  arrays,  n  long.  A  grammar  in  this  space 
is  simply  a  particular  //-length  array  of  0's  and  Is:  hence 
there  are  2"  possible  grammars  (languages).  One  of  Gib¬ 
son  and  VVexler  s  aims  is  to  establish  that  under  some 
simple  hill-climbing  learning  regimes,  namely,  single-step 
gradient  ascent,  some  linguistically  natural,  finite,  spaces 
are  unlearnable,  in  the  sense  that  positive-only  examples 
lead  to  local  marima — incorrect  hypotheses  from  which 
a  learner  can  never  escape.  More  broadly,  they  wish  to 
show  that  learnability  in  such  spaces  is  still  an  inter¬ 
esting  problem,  in  that  there  is  a  substantive  learning 
theory  concerning  feasibility,  convergence  time,  and  the 
like,  that  must  be  addressed  beyond  traditional  linguis¬ 
tic  theory  and  that  might,  even  choose  between  otherwise 
adequate  linguistic  theories. 

In  this  paper,  we  choose  as  a  convenient  starting  point 
their  Triggering  Learning  Algorithm  (TLA)  to  focus  our 
investigation  of  parameter  learning.  Our  central  result 
is  that  the  performance  of  this  algorithm  is  completely 
modeled  by  a  Markov  chain.  The  remainder  of  the  cur¬ 
rent  paper  is  devoted  to  exploring  the  basic  consequences 
of  this  fact. 

Let  us  first  review  the  GW  model  and  the  TLA.  Fol¬ 
lowing  Gold  [2]  the  basic  framework  is  that  of  identifi¬ 
cation  in  the  limit.  The  learner  (child)  starts  out  in  an 
arbitrary  s fait—  some  setting  of  the  ri  parameter  val¬ 
ues.  The  learner  (child)  receives  a  (countably  infinite) 
sequence  of  positive  example  sentences  drawn  front  some 
target  language.  Lt .  After  each  presentation,  the  learner 
can  either  (i)  stay  in  the  same  state;  or  (ii)  move  to  a  new 
hypothesis  state,  using  the  algorithm  given  below.  If  af¬ 
ter  some  finite  number  of  examples  the  learner  converges 
to  the  correct  target  language  (=  parameter  settings) 
and  never  changes  state,  then  it  has  correctly  identified 
the  target  language;  otherwise,  it  does  not  converge. 

In  addition,  in  the  GW  model  the  language  learner 
obeys  two  fundamental  constraints:  (1)  the  single-value 
co  ns  Ira  ini '—the  learner  can  change  only  l  parameter 
value  at  a  time:  and  (2)  the  greediness  constraint — if. 
the  learner  is  given  a  positive  example  it  cannot  recog¬ 
nize  (accept  ),  and  if  the  learner  changes  one  parameter 
value  and  finds  that  it  can  accept  the  example,  then  the 
learner  retains  that  new  parameter  value.  Finally,  we 
also  recall  GW's  definition  of  a  local  tnggcr(  minor  nota- 
tional  changes  aside):  given  values  for  all  parameters  but 
one.  a  local  trigger  for  value  v  of  parameters  />,  .  p,(c).  is 
a  sentence  s  from  the  target  grammar  (ij  such  that  s  is 
grammatical  iff  p,(i )  =  v.  GW  then  state  their  TLA  as 
follows: 

•  [Initialize]  Step  1.  Start  at  some  random  point  in 
the  (finite)  space  of  possible  parameter  settings, 
specifying  n  single  hypothesized  grammar  with  its 
resulting  extension  as  a  language; 

•  [Process  input  sentence]  Step  2.  Receive  a  positive 


example  sentence  .s,  at  time  /,  (examples  drawn 
from  the  language  of  a  single  target  grammar. 
L(Ct't)).  from  a  uniform  distribution  on  the  lan¬ 
guage  (we  shall  be  able  to  relax  this  distributional 
constraint  later  on): 

•  [Learnability  on  error  detection]  Step  3.  If  the  cur¬ 
rent  grammar  parses  (generates)  s,  .  then  go  to  Step 
2:  otherwise,  continue. 

•  [Single-step  gradient-ascent]  Select  a  single  param¬ 
eter  at  random,  uniformly  with  probability  l /it. 
to  flip  from  its  current  setting,  and  change  it  (0 
mapped  to  1.  1  toO)  iff  that  change  allows  the  cur¬ 
rent  sentence  to  be  analyzed:  otherwise  go  to  Step 
2: 

Of  course,  this  algorithm  never  halts  in  the  usual 
sense.  GW  aim  to  show  under  what  conditions  this  al¬ 
gorithm  converges  “in  the  limit" — that  is.  after  some 
number,  //,  of  steps,  where  ;/  is  unknown,  the  correct 
target  parameter  settings  will  be  selected  and  never  be 
changed.  Their  central  claim  is  stated  as  their  Theorem 
1  (p.  7  in  their  manuscript).1 

Theorem  1  As  long  as  the  probability  is  always  greater 
than  a  lower  bound  b  (b  >  0 )  that  the  learner  will  1)  en¬ 
counter  a  local  trigger  for  some  incorrectly-set  parameter 
P.  and  2)  then  reset  P  accordingly  to  the  target  value,  it 
turns  out  that  the  target  grammar  can  always  be  learned 
using  the  Triggering  Learning  Algorithm. 

1.1  The  Markov  formulation 

From  the  standpoint  of  learning  theory,  however.  GW 
leave  open  several  questions  that  can  be  addressed  by 
a  more  precise  formalization  of  this  model  in  terms  of 
Markov  chains  (a  possible  formalization  suggested  but 
left  unpursued  in  footnote  9  of  GW).  We  can  picture 
the  hypothesis  space,  of  size  2".  as  a  set  of  points,  each 
corresponding  to  one  part  icular  array  of  parameter  set¬ 
tings  (languages,  grammars).  Call  each  point  a  hypothe¬ 
sis  state  or  simply  state  of  this  space.  As  is  convent  ional, 
we  define  these  languages  over  some  alphabet  17  as  a  sub¬ 
set  of  37’ .  One  of  them  is  t  he  t  arget  language  (grammar). 
We  arbitrarily  place  the  (single)  target  grammar  at  the 
center  of  this  space.  Since  by  t  he  TLA  the  learner  is  re¬ 
stricted  to  moving  at  most  1  binary  value  in  a  single  step, 
the  theoretically  possible  transitions  between  states  can 
be  drawn  as  (directed)  lines  connecting  parameter  arrays 
(hypotheses)  that  differ  by  at  most  1  binary  digit  (a  0 
or  a  1  in  some  corresponding  position  in  their  arrays). 
Recall  that  this  is  the  so-called  Hamming  distance. 

We  may  further  place  weights  on  the  transitions  from 
state  i  to  state  j  corresponding  to  the  nonzero  6's  men¬ 
tioned  in  the  theorem  above;  these  correspond  to  the 
probabilities  that  the  learner  will  move  from  hypothe¬ 
sis  state  /  to  state  j .  In  fact,  as  we  shall  show  below, 
given  a  distribution  over  L((l).  we  can  further  carry  out 
the  calculation  of  the  actual  6's  themselves.  Thus,  we 

1  Note  that  the  notion  of  “trigger"  does  not  enter  into  the 
statement  of  the  TLA  or  the  constraints  the  TLA  employs, 
but  only  into  the  statement  of  the  theorem. 


can  picture  the  TLA  learning  space  as  a  directed,  la¬ 
beled  graplt  V  with  2"  vertices.2  More  precisely,  we  can 
make  the  following  remarks  about  the  TLA  system  GW 
describe. 

Remark.  The  TLA  system  is  me moryless.  that  is,  given 
a  sequence  s  of  sentences  up  to  time  the  selection 
of  hypothesis  h  depends  only  on  sentence  *, ,  and  not 
(directly)  on  previous  sentences,  i.e., 

/>{/>(*,)  <  <  U- 1}  =  P{*(U)  <  -»,k(/„_i)} 

In  other  words,  the  TLA  system  is  a  classical  dis¬ 
crete  stochastic  process,  in  particular,  a  discrete  Markov 
process  or  Markov  chain.  We  can  now  use  the  theory  of 
Markov  chains  to  describe  TLA  parameter  spaces[3].  For 
example,  as  is  well  known,  we  can  convert  the  graphical 
representation  of  an  n-dimensional  Markov  chain  M  to 
an  (i  x  n  matrix  T,  where  each  matrix  entry  (i.j)  rep¬ 
resents  the  transition  probability  from  state  i  to  state 
j.  A  single  step  of  the  Markov  process  is  computed  via 
the  matrix  multiplication  T  xT;  ?i  steps  is  given  by  Tn . 
A  “1"  entry  in  any  cell  (i.j)  means  that  the  system  will 
converge  with  probability  1  to  state  j,  given  that  it  starts 
in  state  i. 

As  mentioned,  not  all  these  transitions  will  be  pos¬ 
sible  in  general.  For  example,  by  the  single  value  hy¬ 
pothesis.  the  system  can  only  move  1  Hamming  bit  at 
a  time.  Also,  by  assumption,  only  differences  in  surface 
strings  can  force  the  learner  from  one  hypothesis  state  to 
another.  For  instance,  if  state  i  corresponds  to  a  gram¬ 
mar  that  generates  a  language  that  is  a  proper  subset 
of  another  grammar  hypothesis  j,  there  can  never  be  a 
transition  (nonzero  6)  front  j  to  i.  and  there  must  be 
one  from  i  to  j.  Further,  by  assumption  and  the  TLA, 
it  is  clear  that  once  we  reach  the  target  grammar  there 
is  nothing  that  can  move  the  learner  from  this  state, 
since  all  remaining  positive  evidence  will  not  cause  the 
learner  to  change  its  hypothesis.  Thus,  there  must  be  a 
loop  from  the  target  state  to  itself,  with  some  positive 
label  6',  and  no  exit  arcs.  I11  the  Markov  chain  literature, 
this  is  known  as  an  Absorbing  State  (AS).  Obviously,  a 
state  that  only  leads  to  an  AS  will  also  drive  the  learner 
to  that  AS.  Finally,  if  a  state  corresponds  to  a  gram¬ 
mar  that  generates  some  sentences  of  the  target  there 
is  always  a  loop  front  any  state  to  itself,  that  has  some 
nonzero  probability.  Clearly,  one  can  conclude  at  once 
the  following  learnability  result: 

Theorem  2  Given  a  Markov  chain  ('  corresponding  to 
a  GW  TI.A  learner.  3  exactly  J  AS  (corresponding  to 
the  target  grammar/lanyuage)  iff  C  is  UarnabU. 

Proof.  <t=.  By  assumption,  C  is  learnable.  Now  assume 
for  sake  of  contradiction  that  there  is  not  exactly  one 
AS.  Then  there  must  be  either  0  AS  or  >  1  AS.  In  the 
first  case,  by  the  definition  of  an  absorbing  state,  t  here 
is  no  hypothesis  in  which  the  l<  arn^r  will  remain  forever 

2CIW  construct  an  identical  transition  diagram  in  the  de¬ 
scription  of  their  computer  program  for  calculating  local  max¬ 
ima.  However,  this  diagram  is  not  explicitly  presented  as  a 
Markov  structure;  it  does  not  include  t  ransition  probabilities. 
Of  course,  topologically  both  structures  must  be  identical. 


Therefore  C  is  not  learnable,  a  contradiction.  In  the 
second  case,  wit  hout  loss  of  generality,  assume  there  are 
exactly  two  absorbing  states,  the  first  S  corresponding 
to  the  target  parameter  setting,  and  the  second  S'  corre¬ 
sponding  to  some  other  setting.  By  the  definition  of  an 
absorbing  state,  in  the  limit  C  will  with  some  nonzero 
probability  enter  S',  and  never  exit  S'.  Then  C  is  not 
learnable.  a  contradiction.  Hence  our  assumption  that 
there  is  not  exactly  1  AS  must  be  false. 

=>.  Assume  that  there  exists  exactly  1  AS  i  in  the 
Markov  chain  M .  Then,  by  the  definition  of  an  absorbing 
state,  after  some  number  of  steps  n.  no  mat  ter  what  the 
starting  state.  M  will  end  up  in  state  /,  corresponding 
to  the  target  grammar.  I 

Note  that  this  approach  avoids  a  crucial  flaw  in  the 
proof  given  in  GW  (pp.  7-8  in  manuscript  ): 

That  is.  if  the  learner  never  goes  through 
the  same  state  twice,  then  she  is  bound  to  end 
up  in  the  target  state  at  some  point,  because 
the  parameter  space  is  finite  in  size.  Thus  the 
probability  of  avoiding  the  target  state  for¬ 
ever  is  equivalent  to  the  probability  of  cycling 
forever  through  some  ordered  set  of  states  (a 
cycle). 

We  can  divide  the  parameter  space  into  a 
finite  set  of  minimal  cycles,  where  each  min¬ 
imal  cycle  contains  no  cycles  as  a  subpart. 
Because  the  parameter  space  is  finite,  the  set 
of  minimal  cycles  in  the  parameter  space  is 
also  finite.  For  each  minimal  cycle,  we  can 
now  calculate  the  probability  that  the  learner 
remains  in  that  cycle  forever. . .  the  probabil¬ 
ity  of  staying  in  the  [minimal  pm/rcb]  cycle 
in  the  limit  (forever)  is  zero.  The  same  is  true 
for  all  of  the  finitely-many  minimal  cycles,  so 
that  the  probability  of  staying  in  any  of  these 
cycles  in  the  limit  is  also  zero.  Thus  the  prob¬ 
ability  of  ending  up  at  the  target  state  in  the 
limit  is  one. 

In  brief.  GW  attempt  to  show  that  the  probability  of 
the  learner  avoiding  the  target  forever  is  zero  by  showing 
that  the  fact  that  some  minimal  cycle  occurs  infinitely 
often  makes  the  probability  of  the  infinite  sequence  zero. 
In  other  words  every  way  in  which  the  learner  avoids 
the  target  has  probability  zero.  Thus  they  conclude  that 
probability  of  the  event 

Event  =  Learner  avoids  target  forever 
is  zero,  more  precisely,  they  claim. 

/Muir„]  =  0 

where  each  IF,,  is  a  path  avoiding  the  target  and  UH„ 
is  set  of  all  such  paths.  However,  as  is  well  known,  this 
union  computation  is  true  iff  it  is  taken  over  a  countable 
number  of  elements.  In  the  example  at  hand,  the  crucial 
omission  in  the  argument  is  that  the  there  are  an  un¬ 
countable  number  of  ways  in  which  the  learner  can  avoid 
the  target.  This  is  because  there  are  an  uncountable 
number  of  sequences  of  numbers  bet  ween  1  and  M  -  1. 
The  base  M  —  1  expansion  of  any  real  number  in  the 


interval[0, 1 )  would  yield  such  a  sequence  (e.g..  consider 
an  irrational  expansion  such  as  the  square  root  of  2). 

Since  there  are  an  uncountable  number  of  ways  in 
which  the  event  of  avoiding  the  target  forever  can  be 
realized,  the  fact  that  each  such  way  has  probability  zero 
does  not  imply  that  the  total  event  has  probability  zero 
as  well.  To  see  this  consider  a  random  variable  A’  with 
a  uniform  distribution  on  [0,  1].  Now  consider  the  event. 

Event:  A'  <  1/2 

There  are  many  ways  in  which  this  event  could  occur  e.g 
.V  =  1/4.  A'  =  1/3.  A’  =  0.234  etc.  Each  of  these  ways 
has  probability  zero  i.e.,  P[A’  =  1/4]  =  0,  P[A'  =  1/3]  = 
0...  and  so  on.  However  we  know  that  the  probability 
of  the  event  X  <  1/2  is  1/2  not  zero.  This  is  because 
there  are  an  uncountable  number  of  ways  in  which  the 
event  A  <  1/2  could  take  place.  Thus  the  proof  as  given 
in  [1]  is  incorrect.  One  correct  way  to  formulate  the 
proof  is  by  resorting  to  an  explicit  Markov  formulation, 
as  suggested  but  not  executed  in  GW's  footnote  9.  and 
as  we  established  above.  A  similar  conceptual  difficulty 
seemingly  leads  to  their  failure  to  note  that  there  may  be 
other  states  besides  local  maxima,  for  which  convergence 
may  not  occur. 

Corollary  1  Given  a  Markov  chain  corresponding  to  a 
(finite)  family  of  grammars  in  a  GU"  learning  system,  if 
there  exist  2  or  more  ,46',  then  that  family  is  not  learn- 
able. 

Example. 

Consider  the  GW  3-parameter  system.  Its  binary  pa¬ 
rameters  are:  (1)  Specifier )  first  (0)  or  last  (1);  (2) 
Comp(lement)  first  (0)  or  last  (1);  and  Verb  Second  (V2) 
does  not  exist  (0)  or  does  exist  (1).  By  Specifier  GW  fol¬ 
low  the  standard  linguistic  convention  of  whether  there 
is  part  of  a  phrase  that  "specifies"  that  phrase,  roughly, 
like  the  old  in  the  old  book',  by  Complement  GW  roughly 
mean  a  phrase's  arguments,  like  an  ice-cream  in  John  ate 
an  ice-cream  or  with  envy  in  green  with  envy.  There  are 
also  7  possible  "words"  in  this  language:  S.  V'.  O.  01. 
02.  Adv,  and  Aux,  corresponding  to  Subject.  Verb,  Ob¬ 
ject.  Direct  Object,  Indirect  Object,  Adverb,  and  Ad¬ 
jective.  There  are  12  possible  surface  strings  for  each 
(-V2)  grammar  and  18  possible  surface  strings  for  each 
(+V2)  grammar  if  we  restrict  ourselves  to  unembedded 
or  “degree-0"  examples  for  reasons  of  psychological  plau¬ 
sibility  (see  GW  for  discussion).  Note  that  the  "surface 
strings"  of  these  languages  are  actually  phrases  such  as 
Subject,  Verb,  and  Object.  Figure  (3)  of  GW  summa¬ 
rizes  the  possible  binary  parameter  settings  in  this  sys¬ 
tem.  For  instance,  parameter  setting  (5)  corresponds  to 
the  array  [0  1  0]=  Specifier  first,  Comp  last,  and  —  V2. 
which  works  out  to  the  possible  basic  English  surface 
phrase  order  of  Subject -Verb-Object  (SVO).  As  shown 
in  GW's  figure  (3),  the  other  possible  arrangements  of 
surface  strings  corresponding  to  this  parameter  setting 
include  SV;  SV  01  02  (two  objects,  as  in  give  John  an 
ice-cream);  S  Aux  V'  (as  in  John  is  a  nice  guy;  S  Aux  V 
O:  S  Aux  V  Ol  02;  Adv  S  V  (where  Adv  is  an  Adverb, 
like  quickly;  Adv  S  V  O;  Adv  S  V  Ol  02;  Adv  S  Aux  V: 
Adv  S  Aux  V  O;  and  Adv  S  Aux  V  Ol  02. 


Suppose  SOV  (setting  #5=[ 0  1  ()])  is  the  target  gram¬ 
mar  (language).  With  the  GW  3-parameter  system, 
there  are  23  =  8  possible  hypotheses,  so  we  can  draw 
this  as  an  8-point  Markov  configuration  space,  as  shown 
in  the  figure  above.  The  shaded  rings  represent  increas¬ 
ing  Hamming  distances  from  the  target.  Each  labeled 
circle  is  a  Markov  state,  a  possible  array  of  parameter 
settings  or  grammar,  hence  extensionally  specifies  a  pos¬ 
sible  target  language.  Each  state  is  exactly  1  binary 
digit  away  from  its  possible  transition  neighbors.  Each 
directed  arc  between  the  points  is  a  possible  (nonzero) 
transition  from  state  i  to  state  j;  we  shall  show  how  to 
compute  this  immediately  below.  We  assume  that  the 
target  grammar,  a  double  circle,  lies  at  the  center.  This 
corresponds  to  the  (English)  SOV  language.  Surround¬ 
ing  the  bulls-eye  target  are  the  3  other  parameter  arrays 
that  differ  from  [0  1  0]  by  one  binary  digit  each;  we  pic¬ 
ture  these  as  a  ring  1  Hamming  bit  away  from  the  target: 
[0.  1,  1],  corresponding  to  GW's  parameter  setting  #0 
in  their  figure  3  (Spec-first,  Comp-final.  +V2,  basic  or¬ 
der  SVO+V2);  [0  0  0],  corresponding  to  GW's  setting 
#7  (Spec-first.  Comp-first,  — V2).  basic  order  SOV:  and 
[1  1  0].  GW's  setting  #1  (Spec-final.  Comp-final,  —12]. 
basic  order  VOS. 

Around  this  inner  ring  lie  3  parameter  setting  hy¬ 
potheses,  all  2  binary  digits  away  from  the  target:  [0 
0  1],  [1  0  0],  and  [1  1  1]  (grammars  #2,  3.  and  8  in  GW 
figure  3).  Note  that  by  the  .Single  Value  hypothesis  that 
the  learner  can  only  move  one  grey  ring  towards  or  away 
from  the  target  at  any  one  step.  Finally,  one  more  ring 
out.  three  binary  digits  different  from  the  target,  is  the 
hypothesis  [1  0  1],  corresponding  to  target  grammar  4. 

It  is  easy  to  see  from  inspection  of  the  figure  that 
there  are  exactly  2  absorbing  states  in  this  Markov  chain, 
that  is.  states  that  have  no  exit  arcs.  One  AS  is  the 
target  grammar  (by  definition).  The  other  AS  is  state  2. 
Finally,  state  4  is  also  a  sink  (a  so-called  "closed  state” 
in  the  Markov  terminology)  that  leads  only  to  state  4  or 
state  2.  These  two  states  correspond  to  the  local  maxima 
at  the  head  of  GW's  figure  4.  Hence  this  system  is  not 
learnable.  In  addition  to  these  local  maxima,  the  next 
section  below  shows  that  there  are  in  fact  other  states 
from  which  the  learner  can  never  reach  the  target  . 

2  Derivation  of  Transition  Probabilities 
for  the  Markov  TLA  Structure 

The  comput  ation  of  the  transition  probabilities  from  the 
language  family  can  be  computed  by  a  direct  extension 
of  the  procedure  given  in  GW.  Let  the  target  language 
L,  consist  of  the  strings  sj ,  .v, . i.e., 

l-t  =  {*i .  «2. *3.  •••} 

Let  there  be  a  probability  distribut  ion  P  on  t  hese  st  rings. 
Suppose  the  learner  is  in  a  state  corresponding  to  the 
language  L Suppose  it  now  receives  the  string  sj.  It 
will  do  so  with  probability  P(sj).  There  are  two  cases  to 
examine  depending  upon  whether  or  not  the  string  sj  is 
analyzable  by  tin'  grammar  corresponding  to  the  current 
parameter  setting. 

Case  I.  Suppose  the  learner  can  syntactically  analyze 
the  received  string  sj.  By  the  TLA,  it  will  not  change  its 


parameter  values.  In  the  Markov  chain  formulation,  the 
learner  remains  in  the  same  state.  Remember  that  this 
state  corresponds  to  the  language  L, .  Also  note  that 
this  situation  arises  only  when  sj  is  in  the  language 
Therefore  the  probability  of  the  learner  remaining  in  the 
state  s  is  P(*j ). 

Case  II.  Suppose  the  learner  cannot  syntactically  an¬ 
alyze  the  string.  Then  sj  £  By  the  TLA.  the  learner 
chooses  a  parameter  at  random,  flips  it,  and  if  the  new 
parameter  setting  makes  sj  analyzable,  it  retains  this 
value  and  moves  to  the  corresponding  state;  otherwise  it 
remains  in  its  original  state  s.  Let  us  examine  this  situa¬ 
tion  using  the  Markov  chain  formulation.  The  learner  is 
in  state  s.  It  has  n  neighboring  states  each  at  a  Hamming 
distance  of  1  from  itself.  The  learner  picks  one  of  these 
uniformly  at  random.  Imagine  that  iij  of  these  neigh¬ 
boring  states  correspond  to  languages  which  contain  Sj . 
If  the  learner  picks  any  one  of  these  vj  states  (which  of 
course  it  does  with  probability  nj/n),  it  would  stay  in 
that  state.  If  the  learner  picks  any  of  the  other  states 
(with  probability  (n  —  rij)/ii)  then  it  remains  in  state  s. 
Note  that  rij  of  course  could  be  0  which  means  that  none 
of  the  neighboring  states  would  allow  the  string  to  be  an¬ 
alyzed.  The  maximum  value  nj  could  take  is  n.  Thus  we 
see  that  the  probability  that  the  learner  remains  in  state 
s  is  P(sj  )((n  —  Ttj)/n).  The  probability  that  it  moves  to 
each  of  the  other  nj  states  is  P(sj)(l/u). 

Clearly  this  allows  us  to  compute  the  probability  that 
the  learner  will  remain  in  its  original  state  s  as  the  sum 
of  the  probabilities  of  the  above  two  cases,  namely  the 
following  expression: 

Y  P(-S'j)  +  ]T  (1  -  rij/n)P(Sj) 

The  above  expression  is  still  a  little  untidy  because  it  has 
the  rij's  in  it.  We  would  like  to  clean  it  up  a  little.  To  do 
this  consider  the  way  we  would  compute  the  transition 
probability  of  state  s  to  some  other  neighboring  state 
say  k  in  the  chain.  From  the  above  analysis,  we  see 
that  such  a  transition  will  occur  with  probability  l/n 
for  all  the  strings  sj  that  are  in  the  language  Lk  but  not 
in  the  language  L, .  The  strings  themselves  occur  with 
probability  P(*)j)  each  and  so  the  transition  probability 
is: 

P[s-k]=  Y,  (1  /n)P(sj) 

Note  that  the  above  summation  is  done  over  all  strings 
Sj  e  (LtC\Lk)\ where  \  is  the  set  difference  symbol. 
It  is  easy  to  see  that 

Sj  &(LtnLk)\Ls  <>Sj  £(L,nLk)\(Ll  ni,.). 
Thus  we  can  rewrite  the  transition  probability  as 

/>[*-*]=  Y  (i 

Sj&LtnLi, 

Since  we  have  shown  this  in  generality  where  for  any 
given  target  ,  we  can  compute  the  transition  probabilit  ies 
between  any  two  states  in  the  Markov  chain  formulation 
of  the  parameter  space,  the  self-transition  probability 


can  now  be  given  as, 

P[,  _  s]  =  i  -  y  p["  -  *•] 

t-  is  a  neighboring  state  of  * 

Finally,  given  any  parameter  space  with  n  parame¬ 
ters.  we  have  2"  languages.  Fixing  one  of  them  as  the 
target  language  L,  we  obtain  the  following  procedure  for 
constructing  the  corresponding  Markov  chain.  Note  that 
this  is  the  CJW  procedure  for  finding  local  maxima,  with 
the  addition  of  a  probability  measure  on  the  language 
family. 

•  (Assign  distribution)  First  fix  a  probability  mea¬ 
sure  P  on  the  strings  of  the  target  language  Lt. 

•  (Enumerate  states)  Assign  a  state  to  each  language 
i.e..  each  L,. 

•  (Normalize  by  the  target  language.)  Intersect  all 
languages  with  the  target  language  to  obtain  for 
each  the  language  L\  —  L,  fl  Lt.  Thus  with  state 
i  associated  with  language  Li ,  we  now  associate  the 
language  L\ 

•  (Take  set  differences.)  Now  for  any  two  states  i 
and  ft.  if  they  are  more  than  1  Hamming  distance 
apart.,  then  the  transition  P[i  —  k]  =  0.  If  they 
are  1  Hamming  distance  apart  then  P[i  —  k]  = 
P(L'k  \L\). 

This  model  captures  the  dynamics  of  the  TLA  com¬ 
pletely. 

Example. 

Consider  again  the  3-parameter  system  in  the  pre¬ 
vious  figure  with  target  language  5.  We  can  calculate 
the  following  set  differences  to  build  the  Markov  figure 
straightforwardly. 

1.  L\  (IL5  =  0  (no  strings  in  common  between  L\  and 
target  I5). 

2.  L-,  n  L5  ={S  V,  S  V  O,  S  V  01  02,  S  Aux  V,  S 
Aux  V  0.  S  Aux  V’  01  02  }. 

3.  L3  n  L5  =  0. 

4.  L4  n  L5  =  {S  V,  S  V  O.  S  Aux  V}. 

5.  Lj  n  L5  =  7.5. 

6.  is  n  Lr0  =  {s  V.  S  V  O.  S  V  01  02.  S  Aux  V,  S 
Aux  V  O,  S  Aux  V  01  02} 

7.  L-rDLs  =  {S  V.  Adv  S  V  }. 

8.  L»  O  Lr}  =  {S  V,  S  V  O,  S  Aux  V}. 

From  these  values  alone,  we  can  draw  the  figure  illus¬ 
trated.  and  find  the  local  maxima.  For  example,  since 
the  normalized  state  set  for  state  l  is  the  emptyset.  the 
set  difference  bet  ween  states  1  and  5  gives  all  of  the  tar¬ 
get  language:  so  there  is  a  (high)  transition  probability 
from  state  1  to  state  5.  Similarly,  since  states  7  and  8 
share  some  target  language  strings  in  common,  such  as 
S  V.  and  do  not  share  others,  such  as  Adv  S  and  S  V  O. 
the  learner  can  move  from  state  7  to  8  and  back  again. 

Many  additional  properties  of  the  t  riggering  learning 
system  now  become  evident  once  the  mathematical  for¬ 
malization  has  been  given.  It  is  easy  to  imagine  other 


alternatives  to  the  TLA  that  will  avoid  the  local  max¬ 
ima  problem.  For  example,  as  it  stands  the  learner  only 
changes  a  parameter  setting  if  that  change  allows  the 
learner  to  analyze  the  sentence  it  could  not  analyze  be¬ 
fore.  If  we  relax  this  condition  so  that  in  this  situa¬ 
tion  the  learner  picks  a  parameter  at  random  to  change, 
then  the  problem  with  local  maxima  disappears,  because 
there  can  be  only  1  Absorbing  State,  namely  the  target 
grammar.  All  other  states  have  exit  arcs.  Thus,  by  our 
main  theorem,  such  a  system  is  learnable. 

Or  consider  for  example  the  possibility  of  noise — that 
is.  occasionally  the  learner  gets  strings  that  are  not  in 
the  target  language.  GW  state  (fn.  4.  p.  5)  that  this 
is  not  a  problem:  the  learner  need  only  pay  attention 
to  frequent  data.  But  this  is  of  course  a  serious  prob¬ 
lem  for  the  model.  Unless  some  kind  of  memory  or 
frequency-counting  device  is  added,  the  learner  cannot 
know  whether  the  example  it  receives  is  noise  or  not. 
This  being  so.  then  there  is  always  some  finite  proba¬ 
bility.  however  small,  of  escaping  a  local  maximum.  It 
appears  that  the  identification  in  the  limit  framework  as 
given  is  simply  incompatible  with  the  notion  of  noise, 
unless  a  memory  window  of  some  kind  is  added. 

We  may  now  proceed  to  ask  the  following  questions 
about  the  TLA  more  precisely: 

1.  Does  it  converge? 

2.  How  fast  does  it  converge?  How  docs  this  vary  with 
distributional  assumptions  on  the  input  examples? 

3.  Can  we  now  compute  the  dynamics  for  other  "natu¬ 
ral”  parameter  systems,  like  the  10-parameter  sys¬ 
tem  for  the  acquisition  of  st  ress  in  languages  devel¬ 
oped  by  [4]? 

4.  Variants  of  TLA  would  correspond  to  other  Markov 
structures.  Do  they  converge?  If  so,  how  fast? 

5.  How  does  the  convergence  time  scale  up  with  the 
number  of  parameters? 

6.  What  is  the  computational  complexity  of  learning 
parametrized  language  families? 

7.  What  happens  if  we  move  from  on-line  to  batch 
learning?  Can  we  get  PAC-style  bounds  [6]? 

8.  What  does  it  mean  to  have  non-st  at  ionary  (noner- 
godic)  Markov  structures?  How  does  this  relate  to 
assumptions  about  parameter  ordering  and  matu¬ 
ration? 

9.  What  other  parametrizations  can  we  consider? 

In  the  remainder  of  this  paper  we  shall  consider  these 
and  other  questions.  We  turn  first  to  the  question  of 
convergence  and  convergence  times. 

3  Convergence  Times  for  the  Markov 
Chain  Model 

The  Markov  chain  formulation  gives  us  some  distinct 
advantages  in  theoretically  characterizing  the  language 
acquisition  problem.  First,  we  have  already  seen  how 
given  a  Markov  Chain  one  could  investigate  whether  or 
not  it  has  exactly  one  absorbing  state  corresponding  to 
the  target  grammar.  This  is  equivalent  to  the  question  of 


whether  any  local  maxima  exist.  One  could  also  look  at 
other  issues  (like  stationarity  or  ergodicity  assumptions) 
that,  might  potentially  afTect  convergence.  Later  we  will 
consider  several  variants  to  TLA  and  see  how  these  can 
all  be  formally  analyzed  within  the  Markov  formulation. 
We  will  also  see  that  these  variants  do  not  suffer  from 
the  local  maxima  problem  associated  with  GW's  TLA. 

Perhaps  the  significant  advantage  of  the  Markov  chain 
formulation  is  that  it  allows  us  to  also  analyze  conver¬ 
gence  times.  Given  the  transition  matrix  of  a  Markov 
chain,  the  problem  of  how  long  it  takes  to  converge  has 
been  well  studied.  This  quest  ion  is  of  crucial  importance 
in  learnability.  Following  GW.  we  believe  that  it  is  not 
enough  to  show  that  the  learning  problem  is  consistent 
i.e.,  that  the  learner  will  converge  to  the  target  in  the 
limit.  We  also  need  to  show,  as  GW  point  out.  that  the 
learning  problem  is  feasible,  i.e,,  the  learner  will  converge 
in  "reasonable"  time.  This  is  particularly  true  in  the  case 
of  finite  parameter  spaces  where  consistency  might  not 
be  as  much  of  a  problem  as  feasibility.  The  Markov  for¬ 
mulation  allows  us  to  attack  the  feasibility  question.  It 
also  allows  us  to  clarify  the  assumptions  about,  the  be¬ 
havior  of  data  and  learner  inherent  in  such  an  attack. 
We  begin  by  considering  a  few  ways  in  which  one  could 
formulate  the  question  of  convergence  times. 

3.1  Some  Transition  Matrices  and  Their 
Convergence  Curves 

Let  us  begin  by  following  the  procedure  detailed  in  the 
previous  section  to  actually  obtain  a  few  transition  ma¬ 
trices.  Consider  the  example  which  we  looked  at  infor¬ 
mally  in  the  previous  section.  Here  the  target  grammar 
was  grammar  5  and  the  L'  languages  have  already  been 
obtained.  For  simplicity,  let  us  first  assume  a  uniform 
distribution  on  the  strings  in  i5.  i.e.,  the  probability  the 
learner  sees  a  particular  string  Sj  in  i5  is  1/12  because 
there  are  12  (degree-0)  strings  in  Z.5.  We  can  now  com¬ 
pute  the  transition  matrix  as  the  following,  where  0's 
occupy  matrix  entries  if  not  otherwise  specified: 


Ls  b  7  L$ 


Notice  that  both  2  and  5  correspond  to  absorbing 
states;  thus  this  chain  suffers  from  the  local  maxima 
problem.  Note  also  (following  the  previous  figure  as 
well)  that  state  4  only  exits  to  either  itself  or  to  state 
2,  hence  is  also  a  local  maximum.  More  precisely,  if  T 
is  the  transition  probability  matrix  of  a  chain,  then  fjj. 
i.e.  the  element  of  T  in  the  »th  row  and  jth  column  is 
the  probability  that  the  learner  moves  from  state  i  to 
state  j  in  one  step.  It  is  a  well-known  fact  that  if  one 


considers  the  corresponding  i,j  element  of  7W  (hen  this 
is  the  probability  that  the  learner  moves  from  state  i 
to  state  j  in  m  steps.  For  learnability  to  hold  irrespec¬ 
tive  of  which  state  the  learner  starts  in.  the  probability 
that  the  learner  reaches  state  5  should  tend  to  1  as  in 
goes  to  infinity.  This  means  that  column  5  of  Tw  should 
contain  all  l's,  and  the  matrix  should  contain  O  s  every¬ 
where  else.  Actually  we  find  that  Tm  converges  to  the 
following  matrix  as  m  goes  to  infinity: 


Examining  this  mat  rix  we  see  that  if  the  learner  st  arts 
out  in  states  2  or  4.  it  will  certainly  end  up  in  state  2  in 
the  limit.  These  two  states  correspond  to  local  maxima 
granunars  in  the  GW  framework.  If  the  learner  starts  in 
either  of  these  two  states,  it  will  never  reach  the  target. 
From  the  matrix  we  also  see  that  if  the  learner  starts  in 
states  5  through  8,  it  will  certainly  converge  in  the  limit 
to  the  target  grammar. 

The  situation  regarding  states  1  and  3  is  more  inter¬ 
esting.  If  the  learner  starts  in  either  of  these  states,  it 
will  reach  the  target  grammar  with  probability  2/3  and 
reach  state  2,  the  other  absorbing  state  with  probability 
1/3.  Thus  we  see  that  local  maxima  are  not  the  only- 
problem  for  learnability.  GW  (p.  26  in  manuscript) 

focuses  exclusively  on  local  maxima,  and  indirectly  im¬ 
plies  that  these  are  the  only  difficult  states:  “most  of 
the  source  grammars  have  local  triggers  that  enable  the 
learner  to  get  to  the  target. .  .however,  there  exist  pairs 
of  source  and  target  grammars  from  the  parameter  space 
given  in  the  table  in  Figure  3,  such  that,  no  data  from 
the  target  grammar  will  ever  shift  the  learner  out  of  the 
source  grammar. .  .There  are  six  such  pairs  of  source  lo¬ 
cal  maximum  and  target  grammars"  They  then  go  on 
to  list  in  their  figure  4,  two  such  local  maxima  for  the 
target  grammar  5.  corresponding  to  states  2  and  4. 

While  this  statement  is  strictly  true,  it  does  not  ir- 
haust  the  set  of  source  states  that  never  lead  to  the  target 
grammar.  As  we  see  from  the  transition  matrix,  while 
it  is  true  that  states  2  and  4  will,  with  probability  1, 
not  converge  to  the  target  grammar,  it  is  also  true  that 
states  1  and  3  will  not  converge  to  the  target.  Thus,  the 
number  of  "bad"  initial  hypotheses  is  significantly  larger 
than  that  presented  in  Figure  4  of  GW.  This  difference  is 
again  due  to  the  new  probabilistic  framework  introduced 
in  the  current  paper,  and  in  fact  is  related  to  the  diffi¬ 
culty  found  earlier  with  the  central  convergence  proof: 
looking  just  at  minimal  paths  and  cycles  in  fact  misses 
some  possible  learning  paths.  In  the  appendix  of  this  pa¬ 
per.  we  provide  a  complete  list  of  all  starting  states  which 
might  result  in  non-learnability.  While  the  implication  of 
the  existence  of  additional  non-learnable  starting  states 


is  not  clear,  presumably  the  issue  of  learnability  even  in 
the  3-parameter  case  deserves  re-examination  in  light  of 
this  possibility. 

Obviously  one  can  examine  other  details  of  this  par¬ 
ticular  system.  However,  let  us  now  look  at  a  case  where 
there  is  no  local  maxima  problem.  This  is  the  case  when 
the  target  languages  have  verb-second  (V2)  movement 
in  GW's  3-parameter  case.  Consider  the  transition  ma¬ 
trix  obtained  when  the  target  language  is  L\.  Again  we 
assume  a  uniform  distribution  on  strings  of  the  target. 


Here  we  find  that  Tm  does  indeed  converge  to  a  matrix 
with  l's  in  the  first  column  and  O  s  elsewhere.  Consider 
the  first  column  of  T'n .  It  is  of  the  form: 

‘  pi(m)  ' 

p2(m) 

Pi(m) 

Ps(w) 

Pr("i) 

.  Ps(»n)  . 

Here  p,  denotes  the  probability  of  being  in  state  1 
at  the  end  of  in  examples  in  the  case  where  the  learner 
started  in  state  i.  Naturally  we  want 

lim  p,(m)  =  1 

m— -ck, 

and  for  this  example  this  is  indeed  the  case.  The  next 
figure  shows  a  plot  of  the  following  quant  ity  as  a  funct  ion 
of  in,  the  number  of  examples. 

p(in)  =  min  {ft  (in)} 

The  quantity  p(in)  is  easy  to  interpret.  Thus  p(m)  — 
0.95  means  that  for  every  init  ial  state  of  the  learner  the 
probability  that  it  is  in  the  target  state  after  in  exam¬ 
ples  is  at  least  0.95.  Further  there  is  one  initial  state  (the 
worst  initial  state  with  respect  to  the  target,  which  in  our 
example  is  /,*)  for  which  this  probability  is  exactly  0.95. 
We  find  on  looking  at  the  curve  that  the  learner  con¬ 
verges  with  high  probability  within  100  to  200  (degree-0) 
example  sentences,  a  psychologically  plausible  number. 
(One  can  now  of  course  proceed  to  examine  actual  tran¬ 
scripts  of  child  input  to  calculate  convergence  times  for 
“actual"  distributions  of  examples,  and  we  are  currently 
engaged  in  this  effort.) 

As  one  example  of  the  power  of  this  approach,  we 
can  compare  the  convergence  time  of  TLA  to  other  al¬ 
gorithms.  Perhaps  the  simplest  is  random  walk:  start 
the  learner  at  a  random  point  in  the  3-parameter  space. 


and  then,  if  an  input  sentence  cannot  he  analyzed,  move 
randomly  from  stat  e  to  st  ate.  Note  that  this  regime  can¬ 
not  sufTer  from  the  local  maxima  problem,  since  there 
is  always  some  finite  probability  of  exiting  a  non-target 
state. 

To  satisfy  the  reader's  curiosity,  we  provide  the  con¬ 
vergence  curves  for  a  random  walk  algorithm  (RWA)  on 
the  8  state  space.  We  find  that  the  convergence  times 
are  actually  faster  than  for  the  TLA:  see  figure  2.  Since 
the  RWA  is  also  superior  in  that  it  does  not  suffer  from 
the  same  local  maxima  problem  as  TLA,  the  conceptual 
support  for  the  TLA  is  by  no  means  clear.  Of  course, 
it  may  be  that  the  TLA  has  empirical  support,  in  the 
sense  of  independent  evidence  that  children  do  use  this 
procedure  (given  by  the  pattern  of  their  errors,  etc.),  but 
this  evidence  is  lacking,  as  far  as  we  know. 

Now  that  we  have  made  a  first  attempt  to  quantify  t  he 
convergence  time,  several  other  questions  can  be  raised. 
How  does  convergence  time  depend  upon  the  distribu¬ 
tion  of  the  data?  How  does  it  compare  with  other  kinds 
of  Markov  structures  with  the  same  number  of  states? 
How  will  the  convergence  time  be  affected  if  the  num¬ 
ber  of  states  increases,  i.e  the  number  of  parameters  in¬ 
creases?  How  does  it  depend  upon  the  way  in  which 
the  parameters  relate  to  the  surface  strings?  Are  there 
other  ways  to  characterize  convergence  times?  We  now 
proceed  to  answer  some  of  these  questions. 

3.2  Distributional  Assumptions 

In  the  earlier  section  we  assumed  that  the  data  was  uni¬ 
formly  distributed.  We  computed  the  transition  matrix 
for  a  particular  target  language  and  showed  that  conver¬ 
gence  times  were  of  the  order  of  100-200  samples.  In  this 
section  we  show  that  the  convergence  times  depend  cru¬ 
cially  upon  the  distribution.  In  particular  we  can  choose 
a  distribution  which  will  make  the  convergence  time  as 
large  as  we  want.  Thus  the  distribution-free  convergence 
time  for  the  ^parameter  system  is  infinite. 

As  before,  we  consider  the  situation  where  the  target 
language  is  L  j.  There  are  no  local  maxima  problems 
for  this  choice.  We  begin  by  letting  the  distribution  be 
paramet  rized  by  the  variables  a.b.c.d  where 

a  =  P(A  =  {Adv  V  S}) 
b  =  P(B={ Adv  VOS.  Adv  Aux  V  S}) 
r  =  P(C  =  {Adv  V  01  02  S.  Adv  Aux  V  O  S. 

Adv  Aux  V  Ol  02  S}) 

J  =  P(D  =  {V  S}) 

Thus  each  of  the  sets  A.B.C  and  D  contain  different 
degree-0  sentences  of  L\.  Clearly  the  probability  of  the 
set  L\  \  {.4  U  B  U  C  U  D}  is  1  —  (a  +  b  +  c  +  d).  The 
elements  of  each  defined  subset  of  l.\  are  equally  likely 
with  respect  to  each  other.  Setting  positive  values  for 
a.b.c.d  such  that  a+b+c+d  <  1  now  defines  a  unique 
probability  for  each  degree(O)  sentence  in  L\ .  For  exam¬ 
ple.  the  probability  of  Ad rV OS  is  6/2.  the  probability  of 
AdvAurVOS  is  c/3,  that  of  1  'OS  is  ( 1  —  [a+b+c+d))/G 
and  so  on. 

We  can  now  obtain  the  transition  matrix  correspond¬ 
ing  to  this  distribution.  This  is  shown  in  Table  1. 

Compare  this  matrix  with  that  obtained  with  a  uni¬ 
form  distribution  on  the  sentences  of  L\  in  the  earlier 


section.  This  matrix  has  non-zero  elements  (transition 
probabilities)  exactly  where  the  earlier  matrix  had  non¬ 
zero  elements.  However,  the  value  of  each  transition 
probability  now  depends  upon  a.b.c.  and  d.  In  particu¬ 
lar  if  we  choose  a  =  1/12,6  =  2/12,  c  -  3/12.  d  —  1/12 
(this  is  equivalent  to  assuming  a  uniform  distribution) 
we  obtain  the  appropriate  transition  matrix  as  before. 
Looking  more  closely  at  the  general  transition  matrix, 
we  see  that  the  transition  probability  from  state  2  to 
state  1  is  (1  —  {a  +  6  +  r))/3.  Clearly  if  we  make  a  arbi¬ 
trarily  close  to  1.  then  this  transition  probability  is  arbi¬ 
trarily  close  to  0  so  that  t  he  number  of  samples  needed 
to  converge  can  be  made  arbitrarily  large.  Thus  choos¬ 
ing  large  values  for  a  and  small  values  for  6  will  result  in 
large  convergence  times. 

This  means  that  the  sample  complexity  cannot  be 
bounded  in  a  distribution-free  sense,  because  by  choos¬ 
ing  a  highly  unfavorable  distribution  the  sample  com¬ 
plexity  can  be  made  as  high  as  possible.  For  exam¬ 
ple,  we  now  give  the  convergence  curves  calculated  for 
different  choices  of  a.b.c.d.  We  see  that  for  a  uni¬ 
form  distribution  t  he  convergence  occurs  wit  hin  200  sam¬ 
ples.  By  choosing  a  distribution  with  a  -  0.9999  and 
6  =  c  =  d  =  0.000001,  the  convergence  time  can  be 
pushed  up  to  as  much  as  50  million  samples.  (Of  course, 
this  distribution  is  presumably  not  psychologically  real¬ 
istic.)  For  a  =  0.99,6  —  c  —  d  —  0.0001.  the  sample 
complexity  is  on  the  order  of  100. 000  positive  examples. 

3.3  Absorption  Times 

In  the  previous  sections,  we  computed  the  transit  ion  ma¬ 
trix  for  a  variety  of  dist  ribut  ions  and  showed  the  rate  of 
convergence.  In  particular  we  plotted  p(m),  (the  prob¬ 
ability  of  converging  from  the  most  unfavorable  initial 
state)  against  m  (the  number  of  samples).  However,  this 
is  not  the  only  way  to  characterize  convergence  times. 
Given  an  initial  state,  the  time  taken  to  reach  t  he  ab¬ 
sorption  state  (known  as  the  absorption  time)  is  a  ran¬ 
dom  variable.  One  can  compute  the  mean  and  variance 
of  this  random  variable.  For  the  case  when  the  target 
language  is  L\.  we  have  seen  that  the  transition  matrix 
has  the  form: 


Here  Q  is  a  7-dimensional  square  matrix.  The  mean 
absorption  times  from  states  2  through  8  is  given  by  the 
vector  (see  Isaacson  and  Madsen  [3]) 

,/  =  (/-Q)-'l 

where  1  is  a  7-dimensional  column  vector  of  ones.  The 
vector  of  second  moments  is  given  by 

//  =  (/  -  Qr'(2/<  -  1). 

Using  this  result,  we  ran  now  compute  the  mean  and 
standard  deviation  of  the  absorption  time  from  the  most 
unfavorable  initial  state  of  the  learner.  (We  note  that 
t  he  second  moment  is  fairly  skewed  in  such  cases  and  so 
is  not  symmetric  about  t  he  mean,  as  may  be  seen  from 
the  previous  curves.) 
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Learning 

scenario 

Mean  abs. 
time 

St.  Dev. 
of  abs.  time 

TLA  (uniform) 

34.8 

22.3 

TLA  (o  =  0.99) 

45000 

33000 

TLA  (a  =  0.9999) 

4.5  x  106 

3.3  x  10" 

R\\ 

9.6 

10.1 

3.4  Eigenvalue  Rates  of  Convergence 

In  classical  .Markov  chain  theory,  there  are  also  well- 
known  convergence  theorems  derived  from  a  consider¬ 
ation  of  the  eigenvalues  of  the  transition  matrix.  We 
state  without  proof  a  convergence  result  for  transition 
matrices  stated  in  terms  of  its  eigenvalues. 

Theorem  3  Let  T  he  an  n  x  n  transition  mat  nr  mill 
n  linearly  independent  left  eigenvectors  xj _ x->  cor¬ 
responding  to  eigenvalues  A) . A„.  Let  Xn  (an  n- 

dime nsional  vector)  represent  the  starting  probability  of 
being  m  each  state  of  the  chain  and  ir  be  the  limiting 
probability  of  being  in  each  state.  Then  after  k  transi¬ 
tions.  the  probability  of  being  in  each  stale  Xo 7’*  can  he 
described  by 

n  v 

||  x»T*— tr  ||=||  A?xf>yiX;  ||<  max  |A,|*  £  ||  x„y,x,  || 

J<t<n 

where  the  y;  are  the  right  eigenvectors  of  T . 

This  theorem  thus  hounds  the  rate  of  convergence  to 
the  limiting  distribution  ir  (in  cases  where  there  is  only 
one  absorption  state,  7r  will  have  a  1  corresponding  to 
that  state  and  0  everywhere  else).  Using  this  result  we 
can  now  bound  the  rates  of  convergence  (in  terms  of 
number  k  of  samples)  by: 


Learning  scenario 

Rate  of 

Convergence  | 

TLA  (uniform) 

0(0.94*')  I 

TLA(«  =  0.99) 

0((1 

-  io- V) 

TLA(a  =  0.9999) 

0(0 

-  10"")*) 

RW 

0(0.89*) 

This  theorem  also  helps  us  to  see  the  connection  be¬ 
tween  the  number  of  examples  and  the  number  of  pa¬ 
rameters  since  a  chain  with  n  states  (corresponding  to 
an  n  x  a  transition  matrix)  represents  a  language  family 
with  log2(n)  parameters. 

4  Batch  Learning  Upper  and  Lower 
Bounds:  An  Aside 

So  far  we  have  discussed  a  memoryless  learner  moving 
from  state  to  state  in  parameter  space  and  hopefully  con¬ 
verging  to  the  correct  target  in  finite  time.  As  we  saw' 
this  w’as  well-modeled  by  our  Markov  formulation.  In 
this  section  however  we  step  back  and  consider  upper 
and  lower  bounds  for  learning  finite  language  families  if 
the  learner  was  allowed  to  remember  all  the  strings  en¬ 
countered  and  optimize  over  them.  Needless  to  say  this 
might  not  be  a  psychologically  plausible  assumption,  but 
it  can  shed  light  on  the  information-theoretic  complexity 
of  the  learning  problem. 

Consider  a  situation  where  there  are  »  languages 
L\,  in, . . .  L„  over  an  alphabet  E.  Each  language  can 


be  represented  as  a  subset  of  i.e. 

l-i  =  {«*■';  i  . •*,■> *'j  €  V 

The  learner  is  provided  with  positive  data  (strings  that 
belong  to  the  language)  drawn  according  to  distribu¬ 
tion  I1  on  the  strings  of  a  particular  target  language. 
The  learner  is  to  identify  the  target.  It  is  quite  possible 
that  the  learner  receives  strings  that  are  in  more  than 
one  language.  In  such  a  case  the  learner  will  not  be 
able  to  uniquely  identify  the  target.  However,  as  more 
and  more  data  becomes  available,  the  probability  of  hav¬ 
ing  received  only  ambigious  strings  becomes  smaller  and 
smaller  and  eventually  the  learner  will  be  able  to  identify 
the  target  uniquely.  An  interesting  question  to  ask  then 
is  how  many  samples  does  the  learner  need  to  see  so  that 
with  high  confidence  it  is  able  to  identify  the  target,  i.e. 
the  probability  that  after  seeing  that  many  samples,  the 
learner  is  still  ambigious  about  the  target  is  less  than  L. 
The  following  theorem  provides  a  lower  bound. 

Theorem  4  The  learner  needs  to  diaw  at  least  M  = 

max,*,  imTTFTt  Ind/^)  sa,nPl<s  ( wl,( r<  Tj  =  P(L,C\Lj)) 

in  order  to  be  able  to  identify  the  target  with  confidence 
greater  than  1  —  />. 

Proof.  Suppose  the  learner  draws  m  (less  than 
M)  samples.  Let  k  =  argma Xj&Pj-  This  means  1) 
M  =  ini  i/,<k )  M  V^)  an<J  ’7)  that  with  probability  py 
the  learner  receives  a  string  which  is  in  both  Ly  and 
Lt.  Hence  it  will  be  unable  to  discriminate  between 
the  target  the  the  jfcth  language.  After  drawing  m  sam¬ 
ples.  the  probability  that  all  of  them  belong  to  the  set 
L,  Pi  Ly  is  (py)"’ ■  In  such  a  case  even  after  seeing  m 
samples,  the  learner  will  be  in  an  ambiguous  state.  Now 
(pi-)’"  >  (py)*1  since  in  <  M  and  py  <  1.  Finally 
since  A/  ln(  1  ///*- )  =  ln((  l/py)A/ )  =  ln(l//>).  we  see  that 
(py)’"  >  b.  Thus  the  probability  of  being  ambiguous  af¬ 
ter  iv  examples  is  greater  than  b  which  means  that  the 
confidence  of  being  able  to  identify  the  target  is  less  than 
1  -  b.  I 

This  simple  result  allows  us  to  assess  the  number  of 
samples  we  need  to  draw  in  order  to  be  confident  of  cor¬ 
rectly  identifying  the  target.  Note  that  if  the  distribution 
of  the  data  is  very  unfavorable,  that  is,  the  probability 
of  receiving  ambiguous  strings  is  quite  high,  then  the 
number  of  samples  needed  can  actually  be  quite  large. 
While  the  previous  theorem  provides  the  number  of  sam¬ 
ples  necessary  to  identify  the  target,  the  following  theo¬ 
rem  provides  an  upper  bound  for  the  number  of  samples 
that  are  sufficient  to  guarantee  identification  with  high 
confidence. 

Theorem  5  //  the  learner  draws  more  than  M  = 
n,(-|/7li-M)  M  1  /b)  samples,  then  it  will  identify  the  tar¬ 
get  with  confidence  greater  than  1  —  b.  (  Here  bt  = 
P(Li  \  Vj?t  Lj)). 

Proof.  Consider  the  set  L  =  Lt  \  Uj?tLj.  Any  ele¬ 
ment  of  this  set  is  present  in  the  target  language  L,  but 
not  in  any  other  language.  Consequently  upon  receiving 
such  a  string,  the  learner  will  be  able  to  instant  ly  iden¬ 
tify  the  target.  After  ni  >  M  samples,  the  probability 
that-  the  learner  has  not  received  any  member  of  t  his  set 


is  (1  -  P{L))"‘  =  (1  -  b,)m  <  (1  -  b,)SI  =  b.  Hence 
tile  probability  of  seeing  some  member  of  /.  in  those  id 
samples  is  greater  than  1  —  b.  But  seeing  such  a  member 
enables  the  learner  to  identify  the  target  so  the  prob¬ 
ability  that  the  learner  is  able  to  identify  the  target  is 
greater  than  1  —  b  if  it  draws  more  than  M  samples.  I 
To  summarize,  this  section  provides  a  simple  upper 
and  lower  bound  on  the  sample  complexity  of  exact  iden¬ 
tification  of  the  target  language  from  positive  data.  The 
b  parameter  that  measures  the  confidence  of  the  learner 
of  being  able  to  identify  the  target  is  suyg(sln<  of  a 
PAC  [(>]  formulation.  However  there  is  a  crucial  differ¬ 
ence.  In  the  PAC  formulation,  one  is  interested  in  an  <- 
approximation  to  the  target  language  with  at  least  I  —  b 
confidence.  In  our  case,  this  is  not  so.  Since  we  are  not 
allowed  to  approximate  the  target,  the  sample  complex¬ 
ity  shoots  up  with  choice  of  unfavorable  distributions. 
There  are  some  interesting  directions  one  could  follow 
within  this  batch  learning  framework.  One  could  try 
to  get  true  PA( '-style  distribution-free  bounds  for  vari¬ 
ous  kinds  of  language  families.  Alternatively  one  could 
use  the  exact  identification  results  here  for  linguistically 
plausible  language  families  with  "reasonable"  probabil¬ 
ity  distributions  on  the  data.  It  might  be  an  interesting 
exercise  to  recompute  the  bounds  for  cases  where  the 
learner  receives  both  positive  and  negative  data.  Finally 
the  bounds  obtained  here  could  be  sharpened  further. 
We  intend  to  look  into  some  of  these  questions  in  the 
future. 

5  Variants  of  the  Learning  Model 

We  have  so  far  focused  on  tin'  TLA  scheme  for  learn¬ 
ing.  TLA  observes  the  single  value  and  greediness  con¬ 
straints.  There  could  be  several  variants  of  this  learning 
algorithm  and  many  of  these  are  captured  completely 
by  our  Markov  formulation.  We  consider  the  following 
three  simple  variants  by  dropping  either  or  both  of  the 
Single  Value  ami  (ireediness  constraints. 

Random  walk  with  neither  greediness  nor  single 
value  constraints:  We  have  already  seen  this  exam¬ 
ple  before.  The  learner  is  in  a  particular  state.  1‘pon 
receiving  a  new  sentence,  it  remains  in  that  state  if  tin- 
sentence  is  analyzable.  If  not.  the  learner  moves  uni¬ 
formly  at  random  to  any  of  the  other  states  and  stays 
there  waiting  for  tin-  next  sentence.  This  is  done  without 
regard  to  whether  the  new  state  allows  the  sentence  to 
be  analyzed. 

Random  walk  with  no  greediness  hut  with  single 
value  constraint:  The  learner  remains  in  its  original 
state  if  the  new  sentence  is  analyzable.  Otherwise,  t In- 
learner  chooses  one  of  the  parameters  uniformly  at  ran¬ 
dom  and  flips  it  thereby  moving  to  an  adjacent  state  in 
the  Markov  structure.  Again  this  is  done  without  regard 
to  whether  the  new  state  allows  tin-  sentence  to  be  ana¬ 
lyzed.  However  since  only  one  parameter  is  changed  at 
a  time,  the  learner  can  only  move  to  neighboring  states 
at  any  given  time. 

Random  walk  with  no  single  value  constraint  hut 
with  greediness:  The  learner  remains  in  its  original 


slate  if  the  new  sentence  is  analyzable.  Otherwise  the 
learner  moves  uniformly  at  random  to  any  of  the  other 
states  and  stays  there  ifT  the  sentence  can  be  analyzed. 
If  the  sentence  cannot  lx-  analyzed  in  the  new  state  the 
learner  remains  in  its  original  state. 

Fig.  1  shows  the  convergence  times  for  these  three  al¬ 
gorithms  when  I.\  is  the  target  language.  Interestingly, 
all  three  perform  better  than  the  TLA  for  this  task.  Fur¬ 
ther  they  do  not  suffer  from  lo-’al  maxima  problems.  It 
should  be  pointed  out.  however,  that  tin-  differences  from 
TLA  are  marginal  and  this  convergence  has  been  shown 
only  for  /. i  as  the  target  language.  Ideally  the  conver¬ 
gence  rates  have  to  be  computed  for  each  target  language 
and  then  either  a  worst  case  or  average  case  rate  should 
be  decided  upon  to  characterize  the  convergence  times 
for  the  algorithm  on  the  language  family  as  a  whole. 

6  Conclusion,  Open  Questions,  and 
Future  Directions 

As  the  number  of  parameters  »  increases,  the  size  of  the 
corresponding  Markov  matrix  grows  as  2".  Thus  in  the 
case  of  a  10  parameter  system  as  found  in  models  of  En¬ 
glish  stress  ([4])  the  corresponding  Markov  structure  will 
be  a  1024  x  1024  matrix.  \Ve  are  currently  conducting 
an  analysis  of  this  larger  system  to  find  its  local  maxima, 
analyze  its  convergence  times,  and  see  if  its  convergence 
times  correspond  to  what  one  might  find  in  practice  with 
real  stress  systems. 

Additional  questions  remain  to  be  answered.  One  is¬ 
sue  has  to  do  with  the  "smoothness"  relation  between 
the  parameter  settings  and  the  resulting  surface  strings. 
In  principles-and-parameters  theory,  it  has  often  been 
suggested  that  a  small  parameter  change  could  lead  to 
a  large  deductive  change  in  the  grammar,  hence  a  large 
change  in  the  surface  language  generated.  In  all  the  ex¬ 
amples  considered  so  far  there  is  a  smooth  relation  be¬ 
tween  surface  sentences  and  parameters,  in  that  switch¬ 
ing  from  a  Y2  to  a  non-Y2  system,  for  instance,  leads 
us  to  a  Markov  state  that  is  not  too  far  away  from  tin- 
previous  one.  If  this  is  not  so.  it  is  not  so  clear  that 
the  TLA  will  work  as  before.  In  fact,  tin-  whole  ques¬ 
tion  of  how  to  formulate  tin-  notion  of  "smoothness"  in 
a  language  grammar  framework  is  unclear.  \Ye  know 
in  the  case  of  continuous  functions,  for  example,  that 
if  the  learner  is  allowed  to  choose  examples  (which  can 
be  simulated  by  selective  attention),  then  such  an  "ac¬ 
tive"  learner  can  approximate  such  functions  much  more 
quickly  than  a  "passive"  learner,  like  the  one  presented 
in  ( i \Y .  Is  there  an  analog  to  this  in  the  discrete,  digital 
domain  of  language?  How  can  one  approximate  a  lan¬ 
guage?  Here  too  mathematics  may  play  a  helpful  role. 
Recall  that  there  is  an  analog  to  a  functional  analysis 
of  languages  -namely,  the  algebraic  approach  advanced 
by  Chomsky  and  Sclnitzenberger  (['»]).  In  this  model,  a 
language  is  described  by  an  (infinite)  polynomial  gener¬ 
ating  function,  where  the  coefficients  on  the  polynomial 
term  r  gives  the  number  of  ways  of  deriving  the  string 
r.  A  (weak,  string)  approximation  to  a  language  can 
then  lie  defined  in  terms  of  an  approximation  to  the 
generating  function.  If  this  method  can  be  deployed. 


t lien  one  might  he  able  to  carry  over  t lie  results  of  func¬ 
tional  analysis  and  approximation  for  active  vs.  passive 
learners  into  the  "digit al"  domain  of  language.  If  this 
is  possible,  we  would  then  have  a  very  powerful  set  of 
previously  underutilized  mathematical  tools  to  analyze 
language  learnability. 
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Appendix 

A  Learnable  Grammars:  The  Full  Story 

A.l  Problem  States 

We  provide  in  Table  2  a  complete  list  of  problem  states. 
In  other  words  we  list  all  the  initial  starting  grammar- 
target  grammar  pairs  for  which  the  learner  is  not  guar¬ 
anteed  to  converge  to  the  target  with  probability  1.  In 
fact,  assuming  a  uniform  distribution  on  the  strings  for 
the  target  grammar,  it  is  possible  to  compute  the  prob¬ 
ability  of  not  converging  to  the  target  for  each  of  these 
pairs.  Note  that  this  probability  is  non-zero  for  the  pairs 
listed. 

A. 2  Remarks 

1.  We  have  provided  a  complete  list  of  initial  start¬ 
ing  grammars  from  which  some  target  is  not  learn¬ 
able  (j.e.  learnable  with  probability  J).  We  no¬ 
tice  that  there  are  three  kinds  of  such  problem 
starting  states.  Some  states  correspond  to  sinks 
in  the  Markov  Structure  with  respect  to  some  tar¬ 
get  grammar.  Here  the  learner  gets  stuck,  never 
leaves  it  and  correspondingly  never  converges  to 
the  target.  Then  there  are  states  which  are  not 
sinks  (OVS+V2  when  the  target  is  SV0-Y2)  but 
which  can  only  move  to  some  non-target  sink,  and 
so  never  converge  to  the  target.  These  two  kinds 
of  problem  states  (starred  in  our  table)  have  been 
listed  by  Gibson  and  Wexler  in  Fig.  4  (pg.  27  of 
manuscript).  Finally  there  are  states  which  are  not 
sinks,  but  which  can  with  a  non  zero  probability 
converge  to  some  non-target  sink.  They  ran  also 
with  a  non-zero  probability  converge  to  the  target 
and  in  this  respect  are  distinguished  from  problem 
states  of  type  2. 

2.  We  would  like  to  observe  that  of  the  50  possible 
initial  grammar-target  grammar  combinations  pos¬ 
sible.  12  result  in  non-learnable  situations  in  the  3- 
parameter  system  investigated  here.  This  is  a  fairly 
high  density  of  unfavourable  initial  configurations. 
It  would  be  interesting  to  see  how  this  changes  with 
other  lingual  subsystems  with  a  larger  number  of 
parameters. 

.1.  We  also  did  an  analysis  of  convergence  times  under 
uniform  distribution  for  the  each  target  grammar. 
We  fitid  that  the  results  are  similar  to  the  results 
displayed  in  tin-  paper  for  the  case  when  the  target 


grammar  is  (YOS-Y2).  For  cases  when  the  tar¬ 
get  is  learnable.  (In'  learner  converges  to  the  target 
in  100-200  samples  with  high  (greater  than  0.09) 
probability.  Further,  the  variants  of  the  TLA  all 
outperform  t  ho  TLA  in  terms  of  convergence  times. 
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Figure  1:  The  8  parameter  settings  in  the  GW  example,  shown  as  a  Markov  structure,  with  transition  probabilities 
omitted.  (Without  transition  probabilities,  this  diagram  corresponds  exactly  to  that  in  GW's  appendix,  as  mentioned 
above.)  Directed  arrows  between  circles  (states)  represent  possible  nonzero  (possible  learner)  transitions.  The  target 
grammar  (in  this  case,  number  5,  setting  [0  1  0]),  lies  at  dead  center.  Around  it.  are  the  three  settings  that  differ 
from  the  target  by  exactly  one  binary  digit:  surrounding  those  are  the  3  hypotheses  two  binary  digits  away  from  the 
target:  the  third  ring  out  contains  the  single  hypothesis  that  differs  from  the  target  by  3  binary  digits.  Note  that 
the  learner  can  either  cycle  or  step  in  or  out  one  ring  (binary  digit)  at  a  time,  according  to  the  single-step  learning 
hypothesis:  but  some  transitions  are  not  possible  because  there  is  no  data  to  drive  the  learner  from  one  state  to  the 
other  under  the  TLA. 


Table  1:  Transition  matrix  corresponding  to  a  parametrized  choice  for  the  distribution  on  the  large!  strings.  In  this 
case  the  target  is  L\  and  the  distribution  is  parametrized  according  to  Section  3.2. 
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Number  of  examples  (m) 

Figure  2:  Convergence  as  function  of  number  of  examples.  The  horizontal  axis  denotes  the  number  of  examples 
received  and  the  vertical  axis  represents  the  probability  of  converging  to  the  target  state.  The  data  from  the  target 
is  assumed  to  be  distributed  uniformly  over  degree-0  sentences.  The  solid  line  represents  TLA  convergence  times 
and  the  dotted  line  is  a  random  walk  learning  algorithm  (RVVA).  Note  that  random  walk  actually  converges  faster 
than  the  TLA  in  this  case. 
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Figure  3:  Rates  of  convergence  for  TLA  with  L\  as  the  target  language  for  different  distributions.  The  ,t/-axis  plots  the 
probability  of  converging  to  the  target  after  m  samples  and  the  j-axis  is  on  a  log  scale,  i.e..  it  shows  log(  m )  as  tti  varies. 
Thesolid  line  denotes  the  choice  of  an  "unfavorable"  distribution  characterized  by  a  =  0.9999;  b  =  r  =  d  =  0.000001. 
The  dotted  line  denotes  the  choice  of  a  =  0.99;  b  =  c  =  d  =  0.0001  and  the  dashed  line  is  the  convergence  curve  for 
a  uniform  distribution,  the  same  curve  as  plotted  in  the  earlier  figure. 
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Figure  4:  Convergence  rates  for  different  learning  algorithms  when  Li  is  the  target  language.  The  curve  with  the 
slowest  rate  (large  dashes)  represents  the  TLA.  The  curve  with  the  fastest  rate  (small  dashes)  is  the  Random  Walk 
(RWA)  with  no  greediness  or  single  value  constraints.  Random  walks  with  exactly  one  of  the  greediness  and  single 
value  constraints  have  performances  in  between  these  two  and  are  very  close  to  each  other. 


Initial  Grammar 

Target  Grammar 

State  of  Initial  Grammar 
(Markov  Structure) 

Probability  of  Not 
Converging  to  Target 

(SVO-V2) 

(OVS-V2) 

Not  Sink 

0.5 

(SVO+V2)* 

(0VS-V2) 

Sink 

1.0 

(SOV-V2) 

(0VS-V2) 

Not  Sink 

0.15 

~75ov+V2F 

(OVS-V2) 

Sink 

1.0 

(VOS-V2) 

(SVO-V2) 

Not  Sink 

0.33 

(V0S+V2)* 

(SVO-V2) 

~  snnr- 

1.0 

(OVS-V2) 

(SVO-V2) 

Not  Sink 

0.33 

(OVS+V2)* 

(SV0-V2)  “ 

Not  Sink 

1.0 

(VOS-V2) 

(SGV-V2) 

Not  Sink 

0.33 

(V0S+V2)* 

(S0V-V2)  . 

Sink 

1.0 

(0VS-V2) 

(S0V-V2)  n 

Not  Sink 

"0.08 

(GVS+V2)* 

(SOV-V'2) 

Sink 

1.0 

Table  2:  Complete  list  of  problem  states,  i.e.,  all  combinations  of  start  ing  grammar  and  target  grammar  which  result 
in  non-learnability  of  the  target.  The  items  marked  with  an  asterisk  are  those  listed  in  the  original  paper  by  Gibson 
and  Wexler  [lj. 
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